Goto

Collaborating Authors

 score 2


OPOR-Bench: Evaluating Large Language Models on Online Public Opinion Report Generation

Yu, Jinzheng, Xu, Yang, Li, Haozhen, Li, Junqi, Feng, Yifan, Zhu, Ligu, Shen, Hao, Shi, Lei

arXiv.org Artificial Intelligence

Online Public Opinion Reports consolidate news and social media for timely crisis management by governments and enterprises. While large language models have made automated report generation technically feasible, systematic research in this specific area remains notably absent, particularly lacking formal task definitions and corresponding benchmarks. To bridge this gap, we define the Automated Online Public Opinion Report Generation (OPOR-GEN) task and construct OPOR-BENCH, an event-centric dataset covering 463 crisis events with their corresponding news articles, social media posts, and a reference summary. To evaluate report quality, we propose OPOR-EVAL, a novel agent-based framework that simulates human expert evaluation by analyzing generated reports in context. Experiments with frontier models demonstrate that our framework achieves high correlation with human judgments. Our comprehensive task definition, benchmark dataset, and evaluation framework provide a solid foundation for future research in this critical domain.


TRUEBench: Can LLM Response Meet Real-world Constraints as Productivity Assistant?

Park, Jiho, Song, Jongyoon, Choi, Minjin, Heo, Kyuho, Huh, Taehun, Kim, Ji Won

arXiv.org Artificial Intelligence

Large language models (LLMs) are increasingly integral as productivity assistants, but existing benchmarks fall short in rigorously evaluating their real-world instruction-following capabilities. Current benchmarks often (i) lack sufficient multilinguality, (ii) fail to capture the implicit constraints inherent in user requests, and (iii) overlook the complexities of multi-turn dialogue. To address these critical gaps and provide a more realistic assessment, we introduce TRUEBench (Trustworthy Real-world Usage Evaluation Benchmark)1, a novel benchmark specifically designed for LLM-based productivity assistants. TRUEBench distinguishes itself by featuring input prompts across 12 languages, incorporating intra-instance multilingual instructions, employing rigorous evaluation criteria to capture both explicit and implicit constraints, and including complex multi-turn dialogue scenarios with both accumulating constraints and context switches. Furthermore, to ensure reliability in evaluation, we refined constraints using an LLM validator. Extensive experiments demonstrate that TRUEBench presents significantly greater challenges than existing benchmarks; for instance, a strong model like OpenAI o1 achieved only a 69.07% overall pass rate. TRUEBench offers a demanding and realistic assessment of LLMs in practical productivity settings, highlighting their capabilities and limitations.


Insufficient Statistics Perturbation: Stable Estimators for Private Least Squares

Brown, Gavin, Hayase, Jonathan, Hopkins, Samuel, Kong, Weihao, Liu, Xiyang, Oh, Sewoong, Perdomo, Juan C., Smith, Adam

arXiv.org Machine Learning

We present a sample- and time-efficient differentially private algorithm for ordinary least squares, with error that depends linearly on the dimension and is independent of the condition number of $X^\top X$, where $X$ is the design matrix. All prior private algorithms for this task require either $d^{3/2}$ examples, error growing polynomially with the condition number, or exponential time. Our near-optimal accuracy guarantee holds for any dataset with bounded statistical leverage and bounded residuals. Technically, we build on the approach of Brown et al. (2023) for private mean estimation, adding scaled noise to a carefully designed stable nonprivate estimator of the empirical regression vector.


PAGER: A Framework for Failure Analysis of Deep Regression Models

Thiagarajan, Jayaraman J., Narayanaswamy, Vivek, Trivedi, Puja, Anirudh, Rushil

arXiv.org Machine Learning

Safe deployment of AI models requires proactive detection of potential prediction failures to prevent costly errors. While failure detection in classification problems has received significant attention, characterizing failure modes in regression tasks is more complicated and less explored. Existing approaches rely on epistemic uncertainties or feature inconsistency with the training distribution to characterize model risk. However, we show that uncertainties are necessary but insufficient to accurately characterize failure, owing to the various sources of error. In this paper, we propose PAGER (Principled Analysis of Generalization Errors in Regressors), a framework to systematically detect and characterize failures in deep regression models. Built upon the recently proposed idea of anchoring in deep models, PAGER unifies both epistemic uncertainties and novel, complementary non-conformity scores to organize samples into different risk regimes, thereby providing a comprehensive analysis of model errors. Additionally, we introduce novel metrics for evaluating failure detectors in regression tasks. We demonstrate the effectiveness of PAGER on synthetic and real-world benchmarks. Our results highlight the capability of PAGER to identify regions of accurate generalization and detect failure cases in out-of-distribution and out-of-support scenarios.


Empirical Analysis of Model Selection for Heterogeneous Causal Effect Estimation

Mahajan, Divyat, Mitliagkas, Ioannis, Neal, Brady, Syrgkanis, Vasilis

arXiv.org Artificial Intelligence

We study the problem of model selection in causal inference, specifically for the case of conditional average treatment effect (CATE) estimation under binary treatments. Unlike model selection in machine learning, there is no perfect analogue of cross-validation as we do not observe the counterfactual potential outcome for any data point. Towards this, there have been a variety of proxy metrics proposed in the literature, that depend on auxiliary nuisance models estimated from the observed data (propensity score model, outcome regression model). However, the effectiveness of these metrics has only been studied on synthetic datasets as we can access the counterfactual data for them. We conduct an extensive empirical analysis to judge the performance of these metrics introduced in the literature, and novel ones introduced in this work, where we utilize the latest advances in generative modeling to incorporate multiple realistic datasets. Our analysis suggests novel model selection strategies based on careful hyperparameter tuning of CATE estimators and causal ensembling.


Using Time-Series Privileged Information for Provably Efficient Learning of Prediction Models

Karlsson, Rickard, Willbo, Martin, Hussain, Zeshan, Krishnan, Rahul G., Sontag, David, Johansson, Fredrik D.

arXiv.org Machine Learning

We study prediction of future outcomes with supervised models that use privileged information during learning. The privileged information comprises samples of time series observed between the baseline time of prediction and the future outcome; this information is only available at training time which differs from the traditional supervised learning. Our question is when using this privileged data leads to more sample-efficient learning of models that use only baseline data for predictions at test time. We give an algorithm for this setting and prove that when the time series are drawn from a non-stationary Gaussian-linear dynamical system of fixed horizon, learning with privileged information is more efficient than learning without it. On synthetic data, we test the limits of our algorithm and theory, both when our assumptions hold and when they are violated. On three diverse real-world datasets, we show that our approach is generally preferable to classical learning, particularly when data is scarce. Finally, we relate our estimator to a distillation approach both theoretically and empirically.


Improving probability selecting based weights for Satisfiability Problem

Fu, Huimin, Xu, Yang, Liu, Jun, Wu, Guanfeng, Geoff, Sutcliffe

arXiv.org Artificial Intelligence

The Boolean Satisfiability problem (SAT) is important on artificial intelligence community and the impact of its solving on complex problems. Recently, great breakthroughs have been made respectively on stochastic local search (SLS) algorithms for uniform random k-SAT resulting in several state-of-the-art SLS algorithms Score2SAT, YalSAT, ProbSAT, CScoreSAT and on a hybrid algorithm for hard random SAT (HRS) resulting in one state-of-the-art hybrid algorithm SparrowToRiss. However, there is no an algorithm which can effectively solve both uniform random k-SAT and HRS. In this paper, we present a new SLS algorithm named SelectNTS for uniform random k-SAT and HRS. SelectNTS is an improved probability selecting based local search algorithm for SAT problem. The core of SelectNTS relies on new clause and variable selection heuristics. The new clause selection heuristic uses a new clause weighting scheme and a biased random walk. The new variable selection heuristic uses a probability selecting strategy with the variation of CC strategy based on a new variable weighting scheme. Extensive experimental results on the well-known random benchmarks instances from the SAT Competitions in 2017 and 2018, and on randomly generated problems, show that our algorithm outperforms state-of-the-art random SAT algorithms, and our SelectNTS can effectively solve both uniform random k-SAT and HRS.


Refined bounds for algorithm configuration: The knife-edge of dual class approximability

Balcan, Maria-Florina, Sandholm, Tuomas, Vitercik, Ellen

arXiv.org Artificial Intelligence

Automating algorithm configuration is growing increasingly necessary as algorithms come with more and more tunable parameters. It is common to tune parameters using machine learning, optimizing performance metrics such as runtime and solution quality. The training set consists of problem instances from the specific domain at hand. We investigate a fundamental question about these techniques: how large should the training set be to ensure that a parameter's average empirical performance over the training set is close to its expected, future performance? We answer this question for algorithm configuration problems that exhibit a widely-applicable structure: the algorithm's performance as a function of its parameters can be approximated by a "simple" function. We show that if this approximation holds under the L-infinity norm, we can provide strong sample complexity bounds. On the flip side, if the approximation holds only under the L-p norm for p smaller than infinity, it is not possible to provide meaningful sample complexity bounds in the worst case. We empirically evaluate our bounds in the context of integer programming, one of the most powerful tools in computer science. Via experiments, we obtain sample complexity bounds that are up to 700 times smaller than the previously best-known bounds.


Learning to Branch

Balcan, Maria-Florina, Dick, Travis, Sandholm, Tuomas, Vitercik, Ellen

arXiv.org Artificial Intelligence

Tree search algorithms, such as branch-and-bound, are the most widely used tools for solving combinatorial and nonconvex problems. For example, they are the foremost method for solving (mixed) integer programs and constraint satisfaction problems. Tree search algorithms recursively partition the search space to find an optimal solution. In order to keep the tree size small, it is crucial to carefully decide, when expanding a tree node, which question (typically variable) to branch on at that node in order to partition the remaining space. Numerous partitioning techniques (e.g., variable selection) have been proposed, but there is no theory describing which technique is optimal. We show how to use machine learning to determine an optimal weighting of any set of partitioning procedures for the instance distribution at hand using samples from the distribution. We provide the first sample complexity guarantees for tree search algorithm configuration. These guarantees bound the number of samples sufficient to ensure that the empirical performance of an algorithm over the samples nearly matches its expected performance on the unknown instance distribution. This thorough theoretical investigation naturally gives rise to our learning algorithm. Via experiments, we show that learning an optimal weighting of partitioning procedures can dramatically reduce tree size, and we prove that this reduction can even be exponential. Through theory and experiments, we show that learning to branch is both practical and hugely beneficial.


Her2 Challenge Contest: A Detailed Assessment of Automated Her2 Scoring Algorithms in Whole Slide Images of Breast Cancer Tissues

Qaiser, Talha, Mukherjee, Abhik, Pb, Chaitanya Reddy, Munugoti, Sai Dileep, Tallam, Vamsi, Pitkäaho, Tomi, Lehtimäki, Taina, Naughton, Thomas, Berseth, Matt, Pedraza, Aníbal, Mukundan, Ramakrishnan, Smith, Matthew, Bhalerao, Abhir, Rodner, Erik, Simon, Marcel, Denzler, Joachim, Huang, Chao-Hui, Bueno, Gloria, Snead, David, Ellis, Ian, Ilyas, Mohammad, Rajpoot, Nasir

arXiv.org Artificial Intelligence

Evaluating expression of the Human epidermal growth factor receptor 2 (Her2) by visual examination of immunohistochemistry (IHC) on invasive breast cancer (BCa) is a key part of the diagnostic assessment of BCa due to its recognised importance as a predictive and prognostic marker in clinical practice. However, visual scoring of Her2 is subjective and consequently prone to inter-observer variability. Given the prognostic and therapeutic implications of Her2 scoring, a more objective method is required. In this paper, we report on a recent automated Her2 scoring contest, held in conjunction with the annual PathSoc meeting held in Nottingham in June 2016, aimed at systematically comparing and advancing the state-of-the-art Artificial Intelligence (AI) based automated methods for Her2 scoring. The contest dataset comprised of digitised whole slide images (WSI) of sections from 86 cases of invasive breast carcinoma stained with both Haematoxylin & Eosin (H&E) and IHC for Her2. The contesting algorithms automatically predicted scores of the IHC slides for an unseen subset of the dataset and the predicted scores were compared with the 'ground truth' (a consensus score from at least two experts). We also report on a simple Man vs Machine contest for the scoring of Her2 and show that the automated methods could beat the pathology experts on this contest dataset. This paper presents a benchmark for comparing the performance of automated algorithms for scoring of Her2. It also demonstrates the enormous potential of automated algorithms in assisting the pathologist with objective IHC scoring.